This report was built with R version 4.0.2.
Welcome to the Dickson Lab’s introduction to studying the lung microbiome. In this tutorial we’ll see how we can use R to perform data manipulation, modeling, visualization, and reporting as part of our analysis workflow. If you have not already done so, please download R, RStudio, and the necessary packages. See this guide for more details.
There are many quality of life features that aid in making your time spent coding more efficient and reproducible. For example, many keyboard shortcuts exist in RStudio for common tasks. Navigate to Help > Keyboard Shortcuts Help, for a full list. There are many other useful features, but a full discussion of these is outside the scope of this presentation.
These following suggested practices will aid you as you begin writing your own code. We’ll also demonstrate these throughout the tutorial.
?function_name, such as ?ggplot to view the help documentation for a function or library.package_name::function_name()
dplyr::select(). It also informs users which library a function is from.%>%. This can drastically improve the efficiency of your time spent coding, and is a natural fit for data wrangling workflows. We’ll see several examples of this in action.We will be using 16S data resulting from mothur processing and metadata that was previously generated. All of this data is available here.
Before we begin bringing in the data to begin our analysis, we need to load the libraries that we’ll use. Libraries contain additional code (primarily in the form of functions) and/or data that extend R’s base functionality. A good practice is to keep a code chunk near the top of your script that contains all the libraries that are utilized in the analysis, so it’s easy to keep track of all of the dependencies that your code has.
# This section contains typical libraries used in these analyses
# I. universal data wrangling and visualization
library(tidyverse) # ggplot, dplyr, purrr, forcats, stringr... and other packages. See https://tidyverse.tidyverse.org/
# II. machine learning & analysis packages
library(vegan) # decostand(), metaMDS(), scores(), rda(), ordispider(), adonis()
# library(mvabund) # mvabund(), manyglm() # This line can be uncommented when using mvabund as part of the 16S analysis.
library(randomForest) # randomForest()
# III. formatting, plotting
library(scales) # used in formatting relative abundance % labels; unit_format()
library(RColorBrewer) # brewer.pal() for generating custom palettes
library(knitr) # kable(), for displaying nicer tables in html/pdf output
Next we’ll bring in the 16S data from two mothur files which have the raw counts and taxonomy information.
# For ease of use, place the processed 16S data files of interest into the same directory as your current RStudio project
# Best practices: 1) keep a separate project for each experiment or combination of experiments
# 2) include a data folder within the project's directory to keep source files organized; also somewhat mitigates the risk of modifying them inadvertently
# I. Read in .shared (counts of sequences assigned to OTU #s)
# Includes basic processing with selection of OTUs > 0.1 % of the population
otu_raw <- read.table("data/Heterogeneity.shared.txt", row.names = 2, header = TRUE, stringsAsFactors = FALSE) # read in the data, treating the first row as column names
otu_trim <- otu_raw[, -c(1:2)] # drop the first 2 columns
otu_trim <- as.matrix(otu_trim)
otu_tmp <- vegan::decostand(otu_trim, "total") * 100 # standardization
otu_trim[which(otu_tmp < 0.1)] <- 0
otu_good <- otu_trim[, which(colSums(otu_trim) > 0)] # keep only the OTUs which were present after setting quantities under 0.1 to 0
# Trim _S from end of rownames(otu_good); note that it may be useful to keep these well identifiers in some analyses
# Also clean up extra underscores and Xs from rownames
rownames(otu_good) <- str_remove(rownames(otu_good), "_S\\d+") %>%
str_replace_all(pattern = "[_X]+", replacement="_") %>%
str_remove(pattern="_$")
# It's a good idea to inspect the data regularly with tools such as dim(), str(), head(), tail(), summary()... This will help in catching errors earlier
dim(otu_good) # dim(otu_good)[1] observations (samples) x dim(otu_good)[2] features (OTUs above .1% population threshold)
## [1] 192 730
otu_good[1:20, 1:10] # Examine first 20 rows, 10 columns. Try using other means for getting a glimpse of the data, such as str(otu_good).
## Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 Otu00006
## Control_01_AE 101 0 1 3 0 1
## Control_02_AE 404 0 179 0 0 0
## Control_03_AE 2 0 0 0 0 1
## Control_04_AE 6165 0 4734 0 0 0
## Control_05_AE 16177 0 2331 0 0 0
## Control_06_AE 38002 0 0 0 0 0
## Control_07_AE 1307 0 133 0 11 2346
## Control_08_AE 11 0 67 3 6 72
## Control_01_Blank 86 0 0 0 0 54
## Control_02_Blank 9763 0 0 1251 0 55
## Control_03_Blank 106 0 1075 0 0 0
## Control_04_Blank 2026 0 487 3 3 0
## Tissue_CharRiv1_10_Cecum 0 0 0 0 0 0
## Tissue_CharRiv1_10_Lung 0 0 1 0 0 0
## Tissue_CharRiv1_10_Nasal 225 0 4515 0 0 0
## Tissue_CharRiv1_10_Tongue 0 37998 0 0 0 0
## Tissue_CharRiv1_01_Cecum 0 0 1169 0 0 0
## Tissue_CharRiv1_01_Lung 27 19 2 9 0 0
## Tissue_CharRiv1_01_Nasal 1028 203 443 0 20631 115
## Tissue_CharRiv1_01_Tongue 0 19405 0 0 0 0
## Otu00007 Otu00008 Otu00009 Otu00010
## Control_01_AE 0 3 0 1
## Control_02_AE 0 1 0 0
## Control_03_AE 0 0 3 0
## Control_04_AE 0 0 0 0
## Control_05_AE 0 0 0 0
## Control_06_AE 0 0 0 0
## Control_07_AE 0 0 0 0
## Control_08_AE 0 1 2 0
## Control_01_Blank 0 0 0 0
## Control_02_Blank 0 0 0 0
## Control_03_Blank 0 0 0 406
## Control_04_Blank 0 3 0 0
## Tissue_CharRiv1_10_Cecum 872 0 104 0
## Tissue_CharRiv1_10_Lung 2 1 0 0
## Tissue_CharRiv1_10_Nasal 0 0 0 0
## Tissue_CharRiv1_10_Tongue 0 0 0 0
## Tissue_CharRiv1_01_Cecum 5643 0 201 0
## Tissue_CharRiv1_01_Lung 0 0 1 3
## Tissue_CharRiv1_01_Nasal 0 0 0 2596
## Tissue_CharRiv1_01_Tongue 0 0 0 0
# II. Read in cons.taxonomy (for each OTU: contains taxonomic identifiers down to Genus level, if possible)
# Note that OTUs are in rows here rather than columns
# 2 imports are required due to the combination of 2 different separators
tax1 <- read.table("data/Heterogeneity.0.03.cons.taxonomy.txt", sep = "\t", row.names = 1, header = TRUE, colClasses = c("character", "numeric", "character"))
tax2 <- read.table("data/Heterogeneity.0.03.cons.taxonomy.txt", sep = ";", skip = 1, col.names = c("", "Phylum", "Class", "Order", "Family", "Genus", ""), stringsAsFactors = FALSE) %>%
dplyr::select(-c(1, 7)) %>% # drop columns 1 and 7
purrr::map_df(stringr::str_replace, "\\(.*.\\)", "") %>% # this essentially trims (100) and the like
purrr::map_df(as.factor) # turns all present columns into factor type variables
otu_taxonomy <- data.frame(OTU = rownames(tax1), Size = tax1[, 1], tax2, stringsAsFactors = FALSE) # combine the first and second import
rownames(otu_taxonomy) <- rownames(tax1)
# trim otu_taxonomy based on OTUs in otu_good
otu_good_taxonomy <- otu_taxonomy[otu_taxonomy$OTU %in% colnames(otu_good), ] %>%
droplevels() # Remove factor levels that have no values remaining due to removal
rownames(otu_good_taxonomy) <- intersect(rownames(otu_taxonomy), colnames(otu_good))
# otu_good_taxonomy <- as.data.frame(otu_good_taxonomy) redundant in this case
dim(otu_good_taxonomy) # Displays number of rows and columns
## [1] 730 7
otu_good_taxonomy # Examine the taxonomy dataframe. The first 10 OTUs coincide with the glimpse we took of otu_good. We're leveraging the df_print: paged option to be able to look through the dataframe in an html output.
In order to aid in dividing our samples into groupings, we’ll also need metadata or environmental data. Here we illustrate how that can be done.
# In this section, we'll load in metadata/environmental data, and/or create it
## Load approach using metadata from github repo
meta_df <- read.table("data/Heterogeneity.metadata.txt", sep = "\t", header = TRUE, stringsAsFactors = TRUE,
colClasses = c("specimen" = "character"))
# Note that this will create factors out of all text (string) columns, except for specimen which we enforced as character type.
# Factors are useful for 'reagent_tissue', 'litter', 'vendor', 'mouse', 'tissue_type', and 'cage' variables.
otu_df <- data.frame(meta_df, decostand(otu_good, "total") * 100, stringsAsFactors = FALSE)
# Examples of metadata creation using regular expressions (regex) in a tidyverse pipeline
# Note that extracting metadata from sample names is much easier if naming conventions are consistent in terms of spelling and placement
# otu_df <- data.frame(decostand(otu_good, "total") * 100, Sample_name = row.names(otu_good), stringsAsFactors = FALSE) %>%
# # Create Specimen_ctrl from first string of characters prior to first "_"
# mutate(Specimen_ctrl = factor(case_when(str_detect(Sample_name, "^AE_") ~ "AE",
# str_detect(Sample_name, "^Blank\\d_") ~ "Empty",
# str_detect(Sample_name, "^Heparin_") ~ "Heparin",
# str_detect(Sample_name, "^HomogCtrl_") ~ "HomogCtrl",
# str_detect(Sample_name, "^IsoCtrl_") ~ "IsoCtrl",
# str_detect(Sample_name, "^LPS_") ~ "LPS",
# str_detect(Sample_name, "^PerfusionPBS_") ~ "PBS",
# str_detect(Sample_name, "^WATERD|^waterC") ~ "WaterNeg",
# str_detect(Sample_name, "^mock[CD]") ~ "Mock",
# str_detect(Sample_name, "^[:alnum:]+") ~ str_extract(Sample_name, "^[:alnum:]+")), # for experimental groups, place after other steps because this regex would fill in values for controls (may or may not be ideal)
# levels = c("CR1", "CR2", "JAX1", "JAX2", "AE", "Empty", "Heparin", "HomogCtrl", "IsoCtrl", "LPS", "PBS", "WaterNeg", "Mock")
# )
# ) %>%
# # Create Organ
# mutate(Organ = factor(case_when(str_detect(Sample_name, "_CECUM_") ~ "Cecum",
# str_detect(Sample_name, "_LUNG_HOMOG") ~ "Lung",
# str_detect(Sample_name, "_NASAL_") ~ "Nasal",
# str_detect(Sample_name, "_TONGUE_") ~ "Tongue"),
# levels = c("Nasal", "Tongue", "Lung", "Cecum")
# ) # sum(is.na(otu_df$Organ)) returns 32 (as expected, controls have NA in this column)
# ) %>%
# # Create Mouse
# mutate(Mouse = as.numeric(str_extract(Sample_name, "(?<=(CR\\d|JAX\\d)_)\\d+") # numbers correspond to mice if a CR or JAX specimen
# ) # sum(is.na(otu_df$Mouse)) returns 32 (as expected, the controls have NA in this column)
# ) %>%
# # metadata columns at front, followed by all of the count data
# dplyr::select(Sample_name:Mouse, everything())
# # note that the data is in wide format
#
# otu_df$Specimen_ctrl %>% table() %>% sum() # 192: all samples/rows are accounted for
# otu_df$Organ %>% summary() %>% sum() # ditto; 32 NAs for organs
Here is a glimpse of the metadata, done by randomly sampling 10 rows:
# The following lines of code randomly sample and display 10 rows from the metadata dataframe
set.seed(2358)
kable(dplyr::sample_n(meta_df, 10), caption = "Metadata, 10 randomly sampled rows", align = "l")
| specimen | reagent_tissue | litter | vendor | mouse | tissue_type | log_IL1a | VEGF | FGF | IL17 | IL2 | MIG | IL4 | cage | IL1a | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 65 | Tissue_CharRiv2_03_Cecum | Tissue | CharRiv2 | Charles_River | CR2_3 | Cecum | 0.6490036 | 107.92794 | 777.5026 | 0.5802547 | 11.120115 | 16.969256 | 15.203014 | CR_2_1 | 4.456599 |
| 99 | Control_02_Isolat | Control | None | None | 2 | Isolat | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 35 | Tissue_CharRiv1_05_Nasal | Tissue | CharRiv1 | Charles_River | CR1_5 | Nasal | 1.0674380 | 55.89532 | 628.9448 | 0.8915714 | 11.297230 | 35.834329 | 12.275795 | CR_1_1 | 11.679871 |
| 19 | Tissue_CharRiv1_01_Nasal | Tissue | CharRiv1 | Charles_River | CR1_1 | Nasal | 1.1128108 | 44.44161 | 475.8391 | 1.2137071 | 16.346751 | 29.096520 | 18.006726 | CR_1_1 | 12.966143 |
| 23 | Tissue_CharRiv1_02_Nasal | Tissue | CharRiv1 | Charles_River | CR1_2 | Nasal | 0.8003082 | 53.83172 | 545.8593 | 0.5802547 | 14.379535 | 15.626977 | 10.432549 | CR_1_1 | 6.314052 |
| 54 | Tissue_CharRiv2_10_Lung | Tissue | CharRiv2 | Charles_River | CR2_10 | Lung | 0.2469431 | 140.09801 | 928.8854 | 0.0000000 | 12.390619 | 6.258374 | 0.000000 | CR_2_2 | 1.765806 |
| 144 | Tissue_Jackson2_10_Lung | Tissue | Jackson2 | Jackson_Labs | JL2_10 | Lung | 0.6219255 | 59.07050 | 434.6783 | 0.0000000 | 8.156213 | 10.266051 | 0.000000 | JL_2_2 | 4.187217 |
| 47 | Tissue_CharRiv1_08_Nasal | Tissue | CharRiv1 | Charles_River | CR1_8 | Nasal | 0.8671251 | 56.52897 | 480.3836 | 0.0000000 | 6.629798 | 18.313165 | 9.534462 | CR_1_2 | 7.364191 |
| 51 | Tissue_CharRiv1_09_Nasal | Tissue | CharRiv1 | Charles_River | CR1_9 | Nasal | 0.6081747 | 46.12535 | 575.6304 | 0.0000000 | 8.701160 | 19.657075 | 6.788356 | CR_1_2 | 4.056717 |
| 132 | Tissue_Jackson1_07_Lung | Tissue | Jackson1 | Jackson_Labs | JL1_7 | Lung | 0.3590540 | 76.71263 | 481.8300 | 0.0000000 | 6.366398 | 10.266051 | 0.000000 | JL_1_2 | 2.285883 |
We’ll also create a custom color palette for use in some of our visualizations. Although we might not realize we need this until later in the analysis, it makes sense to include elements like this in the preparation part of your analysis. Apart from the organizational benefit, code that relies on the palette won’t run if the palette hasn’t been defined - so it is best to define the palette before creating any visualizations.
# for PCA ordinations; choose an 9 that covers the maximum number of factor levels (conditions) within a variable of interest
pcapal <- brewer.pal(9,"Set1")
It’s important to quantify the total bacterial load, typically done as one of the first steps in our analysis.
We’ll show an example using ddPCR, but first we need to load the data.
# The readxl library is installed along with tidyverse, but is not loaded with library(tidyverse). We can however access the read_xlsx function without loading readxl as shown below.
spec_abs <- readxl::read_xlsx("data/16S_EvaGreen__Heterogeneity1_Data.xlsx", sheet = 2) %>%
dplyr::select(Sample_name = `Sample Name`, DNA_copies_per_ul = `Copies/ul DNA`) %>% # Rename the two columns of interest, and retain only these columns
dplyr::filter(str_detect(Sample_name, "LUNG")) # Keep only the lung tissue samples
ctrl_abs <- readxl::read_xlsx("data/16S_EvaGreen__Heterogeneity1Ctrls_030618.xlsx", sheet = 2) %>%
dplyr::select(Sample_name = `Sample Name`, DNA_copies_per_ul = `Copies/ul Sample`) %>% # Rename the two columns of interest, and retain only these columns
drop_na() %>% # drop rows with mising values
data.frame() # Avoid tibble indexing warning for next step (As of tibble 3.0.0)
# Round up 0.57 DNA_copies to 1 (HomogCtrl_2)
ctrl_abs[ctrl_abs[, "DNA_copies_per_ul"] < 1, "DNA_copies_per_ul"] <- 1
abs_df <- full_join(spec_abs, ctrl_abs, by = c("Sample_name", "DNA_copies_per_ul")) %>% # We could also use bind_rows() or rbind() in this scenario since we renamed the columns to match. However, for more complex operations, it's safer to use a join function and specify the key columns.
mutate(Type = factor(case_when( # We need to create a factor containing the types of samples for creating separate bars in an upcoming visualization.
str_detect(Sample_name, "LUNG") ~ "Lung", # When the Sample_name contains LUNG, assign Lung as the value in the new Type column
TRUE ~ str_extract(Sample_name, "^[:alpha:]+") # Otherwise, determine a value based on the sequence of letters that occur at the start of the Sample_name
), levels = c("AE", "HomogCtrl", "IsoCtrl", "NTC", "Lung"))) # Assign an appropriate order for the factor
# We also need to compute the averages and error
avg_abs_df <- abs_df %>%
group_by(Type) %>% # Before computing an average, identify what categories are being used to aggregate by. In this case, they're contained within the Type variable we just created.
summarize(mean_DNA_copies = mean(DNA_copies_per_ul), SEM = sqrt(var(DNA_copies_per_ul)/length(DNA_copies_per_ul)), .groups = "drop") # Create mean and SEM columns
Next, we’ll examine this ddPCR data using a bar chart created with ggplot.
# Create our bar chart using ggplot
ggplot(avg_abs_df, aes(x = Type, y = mean_DNA_copies, fill = Type)) + # This brings in the dataframe and assigns columns to various plot elements
geom_col() + # The type of plot we want to construct. This is a wrapper for geom_bar(). See ?geom_bar for more details.
geom_errorbar(aes(ymin = mean_DNA_copies - SEM, ymax = mean_DNA_copies + SEM, width = 0.5)) + # An errorbar layer to be added to the bars
scale_y_log10(labels = trans_format("log10", math_format(10^.x))) + # We could convert to log10 scale prior to plotting, or we can do it on the fly
# geom_dotplot(data = abs_df, aes(y = DNA_copies_per_ul), binaxis = "y", dotsize = .6, stackdir = "center", stackratio = 1.5) +
geom_jitter(data = abs_df, aes(y = DNA_copies_per_ul), shape = 1, size = 2) + # This adds a bit of random noise for visibility of points; shape 1 is an unfilled circle
theme_bw() + # A minimalist theme
labs(x = "Control or Specimen Type", y = "log10 mean_DNA_copies per µL", title = "Absolute abundance of negative controls and lung specimens") + # Rename labels
annotation_logticks(sides = "l") # Control whether/where logticks appear
There are several tools we can use to compare the makeup of microbial communities.
In this subsection we will explore visualization methods including PCA ordinations, biplots, rank abundance plots (bar charts), and heat maps.
As a first step, we need to create a PCA model using only the data needed to make our comparison of interest.
It’s important to compare the distribution of bacteria present in specimens and negative controls to gauge possible contamination.
# Filter data to retain controls and lung tissue only
con_lung_df <- otu_df %>% # Note: this is just one of several possible approaches to filtering to retain the correct rows.
filter((reagent_tissue == "Control") | (tissue_type == "Lung")) %>% # 65 specimens
droplevels() # Remove factor levels with no remaining observations. This prevents potential headaches in downstream data wrangling and visualizations.
rownames(con_lung_df) <- con_lung_df$specimen
# PCA of all negative controls + lung tissue
# Keep only the rows of interest for unstandardized data.
pca_df <- otu_good[con_lung_df$specimen, ]
otu_con_lung_hel <- decostand(pca_df, "hellinger") # Standardize use OTU columns only
otu_con_lung_pca <- rda(otu_con_lung_hel)
# summary() produces a list which is rather messy to print out all at once. Let's just examine the importance of the principal components by accessing $cont
summary(otu_con_lung_pca)$cont # 0.1782 0.08601 0.06845 proportion explained by first 3 axes
## $importance
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Eigenvalue 0.1448 0.06989 0.05562 0.04892 0.04276 0.02930 0.02854
## Proportion Explained 0.1782 0.08601 0.06845 0.06021 0.05262 0.03606 0.03512
## Cumulative Proportion 0.1782 0.26426 0.33271 0.39292 0.44554 0.48159 0.51672
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Eigenvalue 0.02439 0.02089 0.02041 0.01800 0.01684 0.01596 0.01459
## Proportion Explained 0.03002 0.02571 0.02511 0.02216 0.02072 0.01964 0.01796
## Cumulative Proportion 0.54674 0.57244 0.59756 0.61972 0.64044 0.66008 0.67804
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Eigenvalue 0.01392 0.01355 0.01288 0.01230 0.01192 0.01164 0.01080
## Proportion Explained 0.01714 0.01668 0.01586 0.01514 0.01467 0.01433 0.01329
## Cumulative Proportion 0.69517 0.71185 0.72771 0.74284 0.75751 0.77184 0.78513
## PC22 PC23 PC24 PC25 PC26 PC27
## Eigenvalue 0.01055 0.01041 0.009766 0.009153 0.008575 0.008428
## Proportion Explained 0.01298 0.01281 0.012019 0.011266 0.010554 0.010373
## Cumulative Proportion 0.79811 0.81092 0.822942 0.834208 0.844762 0.855135
## PC28 PC29 PC30 PC31 PC32 PC33
## Eigenvalue 0.008238 0.007763 0.007641 0.006947 0.006804 0.006603
## Proportion Explained 0.010139 0.009554 0.009405 0.008550 0.008373 0.008127
## Cumulative Proportion 0.865273 0.874827 0.884232 0.892782 0.901156 0.909283
## PC34 PC35 PC36 PC37 PC38 PC39
## Eigenvalue 0.006321 0.005732 0.005266 0.005048 0.004955 0.004656
## Proportion Explained 0.007779 0.007055 0.006481 0.006213 0.006098 0.005731
## Cumulative Proportion 0.917062 0.924116 0.930597 0.936811 0.942909 0.948640
## PC40 PC41 PC42 PC43 PC44 PC45
## Eigenvalue 0.004449 0.004364 0.004106 0.003432 0.003156 0.002749
## Proportion Explained 0.005475 0.005371 0.005054 0.004224 0.003885 0.003384
## Cumulative Proportion 0.954115 0.959486 0.964540 0.968763 0.972648 0.976031
## PC46 PC47 PC48 PC49 PC50 PC51
## Eigenvalue 0.002712 0.002402 0.002199 0.001829 0.001720 0.001580
## Proportion Explained 0.003337 0.002956 0.002706 0.002251 0.002117 0.001945
## Cumulative Proportion 0.979369 0.982325 0.985031 0.987282 0.989399 0.991343
## PC52 PC53 PC54 PC55 PC56 PC57
## Eigenvalue 0.001185 0.001140 0.001018 0.0007551 0.0006741 0.0005562
## Proportion Explained 0.001459 0.001403 0.001253 0.0009294 0.0008297 0.0006845
## Cumulative Proportion 0.992802 0.994204 0.995457 0.9963865 0.9972162 0.9979008
## PC58 PC59 PC60 PC61 PC62
## Eigenvalue 0.0004442 0.0004363 0.0003940 0.0002619 0.0001290
## Proportion Explained 0.0005467 0.0005370 0.0004849 0.0003223 0.0001588
## Cumulative Proportion 0.9984475 0.9989845 0.9994694 0.9997917 0.9999505
## PC63 PC64
## Eigenvalue 2.701e-05 1.320e-05
## Proportion Explained 3.325e-05 1.624e-05
## Cumulative Proportion 1.000e+00 1.000e+00
Next, we see the corresponding ordination (a spider plot).
# This approach combines base R plotting functionality and vegan's ordispider function, which creates the lines connecting points to the centroids.
# Unlike with ggplot(), these lines must all be run at once in order to produce the desired result.
plot(otu_con_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (17.82% Explained)", ylab = "PC2 (8.60% Explained)", main = "PCA of Lung Tissue vs Controls, colored by Type", display = "sites")
points(otu_con_lung_pca, pch = 19, col = pcapal[as.numeric(con_lung_df$reagent_tissue)]) # pch determines the point shape and col determines the colors to be used.
ordispider(otu_con_lung_pca, con_lung_df$reagent_tissue, label = TRUE) # Draw lines from centroids to the points belonging to each group.
legend("topright", levels(con_lung_df$reagent_tissue), pch = 19, col = pcapal, title = "Type") # Format and add details to legends
Let’s walk through how we arrived at this visualization, one line at a time. The plot() function creates axes based on the dimensions of the data. We also supply axis labels and a title.
plot(otu_con_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (17.82% Explained)", ylab = "PC2 (8.60% Explained)", main = "PCA of Lung Tissue vs Controls, colored by Type", display = "sites")
Next we’ll add the points with the points() function. Note that we’re setting the point shape and color with the pch and col parameters, respectively. We also have to run both of these lines at the same time to get the result we’re after.
# Unlike with ggplot(), these lines must all be run at once in order to produce the desired result.
plot(otu_con_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (17.82% Explained)", ylab = "PC2 (8.60% Explained)", main = "PCA of Lung Tissue vs Controls, colored by Type", display = "sites")
points(otu_con_lung_pca, pch = 19, col = pcapal[as.numeric(con_lung_df$reagent_tissue)]) # pch determines the point shape and col determines the colors to be used.
We finish the main elements of a spider plot by drawing lines between the points and centroids, done with ordispider(). The second argument tells the function which variable we want to use for the centroids.
# Unlike with ggplot(), these lines must all be run at once in order to produce the desired result.
plot(otu_con_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (17.82% Explained)", ylab = "PC2 (8.60% Explained)", main = "PCA of Lung Tissue vs Controls, colored by Type", display = "sites")
points(otu_con_lung_pca, pch = 19, col = pcapal[as.numeric(con_lung_df$reagent_tissue)]) # pch determines the point shape and col determines the colors to be used.
ordispider(otu_con_lung_pca, con_lung_df$reagent_tissue, label = TRUE) # Draw lines from centroids to the points belonging to each group.
Lastly, we add a legend to arrive at the final figure.
# This approach combines base R plotting functionality and vegan's ordispider function, which creates the lines connecting points to the centroids.
# Unlike with ggplot(), these lines must all be run at once in order to produce the desired result.
plot(otu_con_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (17.82% Explained)", ylab = "PC2 (8.60% Explained)", main = "PCA of Lung Tissue vs Controls, colored by Type", display = "sites")
points(otu_con_lung_pca, pch = 19, col = pcapal[as.numeric(con_lung_df$reagent_tissue)]) # pch determines the point shape and col determines the colors to be used.
ordispider(otu_con_lung_pca, con_lung_df$reagent_tissue, label = TRUE) # Draw lines from centroids to the points belonging to each group.
legend("topright", levels(con_lung_df$reagent_tissue), pch = 19, col = pcapal, title = "Type") # Format and add details to legends
vegan also supports additional ordinations such as ellipses. Simply adding an ordiellipse call with a color parameter (col) will give us the following plot. Although other plot customizations could be made, this is where we’ll stop for the purposes of this example.
# This is very similar to the previous ordinaiton except that we're drawing ellipses rather than lines connecting the points to the centroids.
plot(otu_con_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (17.82% Explained)", ylab = "PC2 (8.60% Explained)", main = "PCA of Lung Tissue vs Controls, colored by Type", display = "sites")
points(otu_con_lung_pca, pch = 19, col = pcapal[as.numeric(con_lung_df$reagent_tissue)]) # pch determines the point shape and col determines the colors to be used.
ordispider(otu_con_lung_pca, con_lung_df$reagent_tissue, label = TRUE) # Draw lines from centroids to the points belonging to each group.
ordiellipse(otu_con_lung_pca, con_lung_df$reagent_tissue, col = pcapal) # Draw ellipses around the centroids for each group.
legend("topright", levels(con_lung_df$reagent_tissue), pch = 19, col = pcapal, title = "Type") # Format and add details to legends
A biplot can help us understand which microbial features may be more associated with certain specimens or specimen groups.
# Consider whether rownames should be shown, if so they may need to be truncated for visibility
# biplot(otu_con_lung_pca, type = "t") # This will result in a lot of warnings, and though it generates the plot it's not recommended
# Address warnings with the following function, (hides warnings from 0-length arrows)
# Alternatively, set warning = FALSE in the code chunk; does not seem to work on macs
quiet_biplot <- quietly(function(ordiobj, ...) {
biplot(ordiobj, ...)
})
# Basic example: this can become difficult to read with many long specimen identifiers
# biplot_results <- quiet_biplot(otu_con_lung_pca, type = "t")
# A better approach is to use points color coded by the classes of interest as an additional layer on top of the biplot arrows
biplot_results <- quiet_biplot(otu_con_lung_pca, type = "t", font = 2, font.lab = 2, xlab = "PC1 (17.82% Explained)", ylab = "PC2 (8.60% Explained)",
main = "Biplot: Lung Tissue vs Control samples, colored by Type", display = "species", xlim = c(-0.8, 0.62)) # Note, plot zoom for xlim/ylim may need to be adjusted due to changing the display parameter
points(otu_con_lung_pca, pch = 19, col = pcapal[c(5,2)][as.numeric(con_lung_df$reagent_tissue)]) # color palette adjustd to conflict less with biplot arrows
legend("topright", levels(con_lung_df$reagent_tissue), pch = 19, col = pcapal[c(5,2)], title = "Type") # adjust legend colpal to match points
# Another example for replacing specimen labels with C or T
# biplot(prcomp(otu_con_lung_hel), xlabs=ifelse(str_detect(con_lung_df$reagent_tissue, "Control"), "C", "T")) # arrow.len = 0 will also remove warnings, at the cost of losing the arrowhead
In this experiment, we sought to compare murine specimens from Charles River and Jackson Labs.
# Filter data to retain lung tissue only
otu_lung <- dplyr::filter(otu_df, tissue_type == "Lung") %>%
droplevels() # remove levels that are no longer present due to filtering
rownames(otu_lung) <- otu_lung$specimen
# PCA of lung tissue specimens
# Keep only the rows of interest for unstandardized data.
pca_df <- otu_good[otu_lung$specimen, ]
otu_lung_hel <- decostand(pca_df, "hellinger") # Standardize use OTU columns only
otu_lung_pca <- rda(otu_lung_hel)
# summary() produces a list which is rather messy to print out all at once. Let's just examine the importance of the principal components by accessing $cont
summary(otu_lung_pca)$cont # 0.1458 0.10677 0.06776 proportion explained by first 3 axes
## $importance
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Eigenvalue 0.1190 0.08712 0.05529 0.04696 0.03976 0.03475 0.03389
## Proportion Explained 0.1458 0.10677 0.06776 0.05755 0.04873 0.04259 0.04153
## Cumulative Proportion 0.1458 0.25258 0.32034 0.37790 0.42662 0.46921 0.51075
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Eigenvalue 0.02572 0.02484 0.02339 0.02110 0.02078 0.01959 0.01914
## Proportion Explained 0.03152 0.03045 0.02867 0.02587 0.02546 0.02401 0.02345
## Cumulative Proportion 0.54226 0.57271 0.60138 0.62724 0.65271 0.67672 0.70017
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Eigenvalue 0.01861 0.01847 0.01686 0.01579 0.01481 0.01445 0.01370
## Proportion Explained 0.02280 0.02264 0.02066 0.01935 0.01815 0.01771 0.01679
## Cumulative Proportion 0.72297 0.74561 0.76627 0.78563 0.80378 0.82149 0.83828
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Eigenvalue 0.01274 0.01267 0.01224 0.01120 0.01069 0.009415 0.009183
## Proportion Explained 0.01561 0.01552 0.01500 0.01372 0.01310 0.011539 0.011255
## Cumulative Proportion 0.85390 0.86942 0.88442 0.89814 0.91124 0.922779 0.934034
## PC29 PC30 PC31 PC32 PC33 PC34
## Eigenvalue 0.008479 0.007690 0.007462 0.006652 0.006510 0.004720
## Proportion Explained 0.010392 0.009424 0.009145 0.008152 0.007978 0.005784
## Cumulative Proportion 0.944425 0.953850 0.962995 0.971147 0.979125 0.984910
## PC35 PC36 PC37 PC38 PC39
## Eigenvalue 0.004444 0.003119 0.002280 0.001560 0.0009102
## Proportion Explained 0.005447 0.003822 0.002794 0.001912 0.0011155
## Cumulative Proportion 0.990356 0.994178 0.996973 0.998885 1.0000000
# This approach combines base R plotting functionality and vegan's ordispider function, which creates the lines connecting points to the centroids.
# Unlike with ggplot(), these lines must all be run at once in order to produce the desired result.
plot(otu_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (14.58% Explained)", ylab = "PC2 (10.68% Explained)", main = "PCA of Specimen Origin, colored by Vendor", display = "sites")
points(otu_lung_pca, pch = 19, col = pcapal[as.numeric(otu_lung$vendor)]) # pch determines the point shape and col determines the colors to be used.
ordispider(otu_lung_pca, otu_lung$vendor, label = TRUE) # Draw lines from centroids to the points belonging to each group.
legend("topright", levels(otu_lung$vendor), pch = 19, col = pcapal, title = "Vendor") # Format and add details to legends
Next, we’ll examine another biplot.
# Consider whether rownames should be shown, if so they may need to be truncated for visibility
# biplot(otu_lung_pca, type = "t") # This will result in a lot of warnings, and though it generates the plot it's not recommended
# Address warnings with the following function, (hides warnings from 0-length arrows)
# Alternatively, set warning = FALSE in the code chunk; does not seem to work on macs
# see function that was created previously
biplot_results <- quiet_biplot(otu_lung_pca, type = "t")
# example for replacing specimen labels with CR or JL
# biplot(prcomp(otu_lung_hel), xlabs=ifelse(str_detect(lung_df$vendor, "Charles_River"), "CR", "JL")) # arrow.len = 0 will also remove warnings, at the cost of losing the arrowhead
Next we’ll examine two approaches to constructing heat maps, using lung specimens.
If you’re comfortable with the ggplot approach to creating visualizations, you’ll likely be able to more easily customize these plots to your liking.
# Determine the order of OTUs based on means within Lung specimens
otu_order <- names(sort(colMeans(otu_lung[, stringr::str_detect(colnames(otu_lung), "Otu")]), decreasing = TRUE))
heat_df <- otu_lung %>%
select(specimen, log_IL1a, otu_order[1:20]) %>% # Select the top 20 OTUs
pivot_longer(cols = -c("specimen", "log_IL1a"), names_to = "OTU", values_to = "Percentage", # Pivots (longer) all columns expect those specified by -str_which()
names_ptypes = list(OTU = factor(levels = otu_order[1:20])))
# group_by(OTU) %>%
# summarize(Mean_perc = mean(Percentage), .groups = "drop") %>%
# Note that the OTU levels are sorted in reverse order of abundance here - we could adjust this if desired.
ggplot(heat_df, aes(x = specimen, y = OTU)) +
geom_tile(aes(fill = Percentage)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1), plot.margin = unit(c(0.5,0.5,0.5,1),"cm")) + # Extra plot padding is necessary due to the rotation of x-axis labels.
scale_fill_gradient(low = "white", high = "dark blue") + # This allows us to specify the colors used within the heatmap
labs(title = "Relative abundance of top OTUs by Specimen", x = "Lung Tissue Specimen")
Next, we’ll explore a different approach using base R functionality. This time we’ll use dissimilarity data and contruct dendrograms to go with the plot.
# Unfortunately, filtering and other tidyverse functions strip rownames, and rownames are useful for pulling information out of matrices.
# However, there is a function to easily put the specimen column back into rownames
heat_df <- dplyr::filter(otu_df, tissue_type == "Lung") %>%
droplevels() %>%
remove_rownames() %>%
column_to_rownames("specimen") %>% # This turns the specimen column into the rownames.
select(otu_order[1:20]) # Keep only the top 20 OTUs.
# Create Bray-Curtis matrix for lung specimens
bc_lung_spec <- vegdist(select(heat_df, otu_order[1:20]), method = "bray")
# The linkage method to be used will depend on the nature of the data
clust_lung_spec <- hclust(bc_lung_spec, method = "average")
# Create Bray-Curtis matrix for OTUs
bc_lung_otu <- vegdist(t(select(heat_df, otu_order[1:20])), method = "bray")
clust_lung_otu <- hclust(bc_lung_otu, method = "average")
# Create the heatmap with dendrograms
heatmap(as.matrix(select(heat_df, str_which(colnames(heat_df), "Otu"))), Rowv = as.dendrogram(clust_lung_spec), Colv = as.dendrogram(clust_lung_otu), margins=c(5,0))
In the following series of examples, we’ll use RA plots to compare the community composition in lung speciments against negative controls. Note that the strategies we’ll use can be extended to conduct comparisons across any arbitrary groups, assuming metadata exists to identify the observations (rows) belonging to each group. For instance, we might compare four different types of tissue, or drill down into the controls and compare those subtypes.
# First we need to prepare the data: negative controls and lung tissue specimens
# Order the data by most abundant OTU in lung tissue specimens only
# Include an errorbar
# Color by phyla
ra_df <- otu_df %>% # Note at this is just one of several possible approaches to filtering to retain the correct rows
filter((reagent_tissue == "Control") | (tissue_type == "Lung")) %>% # 65 specimens
droplevels() # remove factor levels with no remaining observations - saves potential headaches in downstream wrangling and visualizations
# Determine the order of OTUs based on means within Lung specimens
otu_order <- names(sort(colMeans(filter(ra_df, tissue_type == "Lung")[, str_detect(colnames(ra_df), "Otu")]), decreasing = TRUE))
# Keep top 20 OTUs, then reshape the dataframe from wide to long format
# g_df <- ra_df %>%
# select(str_which(colnames(ra_df), "Otu", negate = TRUE), colnames(ra_df[, otu_order[1:20]])) %>% # Keep non-OTU columns, followed by the 20 most abundant OTU columns
# gather(key = OTU, value = Percentage, -str_which(colnames(ra_df), "Otu", negate = TRUE), factor_key = TRUE) %>% # factor_key is important because we care about the order (levels) in OTU
# droplevels()
# Keep top 20 OTUs, then reshape the dataframe from wide to long format
g_df <- ra_df %>%
select(str_which(colnames(ra_df), "Otu", negate = TRUE), colnames(ra_df[, otu_order[1:20]])) %>% # Keep non-OTU columns, followed by the 20 most abundant OTU columns
pivot_longer(cols = -str_which(colnames(ra_df), "Otu", negate = TRUE), names_to = "OTU", values_to = "Percentage",
names_ptypes = list(OTU = factor(levels = otu_order[1:20]))) # Ensuring OTU column is a factor is important due to the ordering expected in RA plots
# Join Phylum column from taxonomy table
tax_df <- otu_good_taxonomy[otu_good_taxonomy[,"OTU"] %in% otu_order[1:20], c("OTU", "Phylum")] # Keep only the rows from ranked OTU vector, and OTU + whatever taxonomic level to use for filling bars
tax_df$OTU <- factor(tax_df$OTU, levels = otu_order[1:20]) # set OTU factor levels to be in the same order as in g_df. If this is not done, our OTU column will be coerced to a character vector when we join (we'll lose the levels)
g_df <- inner_join(g_df, tax_df, by = "OTU")
# dimensions have gone from 65 745 to 1300 18
# Generate mean and sem for each group
agg_df <- g_df %>%
group_by(reagent_tissue, OTU, Phylum) %>%
summarize(Mean_perc = mean(Percentage), SEM = sqrt(var(Percentage)/length(Percentage)), .groups = "drop")
First we’ll create a basic rank abundance plot, then it will be gradually built up in the following examples.
(p <- ggplot(agg_df, aes(x = OTU, y = Mean_perc, fill = reagent_tissue)) + # Utilize the data and map the variables to plot parameters.
geom_col() +
facet_grid(reagent_tissue ~ .)) # Faceting replicates the plot once for each category contained within specified variable(s).
Let’s add an errorbar and improve the visual aesthetics.
# Add an errorbar
(q <- p + geom_errorbar(aes(ymin = Mean_perc - SEM, ymax = Mean_perc + SEM, width = 0.3)))
# Clear background + modify theme (rotate x-axis labels, reposition legend)
(q <- q + theme_bw() + # note that a theme_ layer added after other other theme modifications will remove these prior modifications
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top"))
Adding Genus labels to Otu identifiers
# Add Genus levels to the OTU Number
levels(agg_df$OTU) <- fct_inorder(paste0(otu_good_taxonomy[levels(agg_df$OTU), ]$Genus, " (", levels(agg_df$OTU), ")"))
# equivalent to: # cbmbtools::paste_tax(levels(agg_df$OTU))
ggplot(agg_df, aes(x = OTU, y = Mean_perc, fill = fct_rev(reagent_tissue))) +
geom_col() +
facet_grid(fct_rev(reagent_tissue) ~ .) +
geom_errorbar(aes(ymin = Mean_perc - SEM, ymax = Mean_perc + SEM, width = 0.3)) +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", plot.margin = unit(c(0.5,0.5,0.5,1),"cm")) +
scale_y_continuous(labels = unit_format(suffix = "%")) +
labs(title="Comparison of Lung Tissue and Controls", subtitle = "Ranked by lung tissue specimens", y = "% Relative Abundance")
How do we adjust the legend and facet labels?
# Although it is possible to adjust the values and labels within the ggplot workflow, perhaps the simplest approach would be to correct the reagent_tissue column values and factor levels prior to plotting
agg_df$reagent_tissue <- fct_recode(agg_df$reagent_tissue, Lung = "Tissue") %>%
fct_relevel("Lung")
ggplot(agg_df, aes(x = OTU, y = Mean_perc, fill = reagent_tissue)) +
geom_col() + # use show.legend = FALSE to remove the legend entirely
facet_grid(reagent_tissue ~ .) +
geom_errorbar(aes(ymin = Mean_perc - SEM, ymax = Mean_perc + SEM, width = 0.3)) +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", plot.margin = unit(c(0.5,0.5,0.5,1),"cm"), legend.title = element_blank()) +
scale_y_continuous(labels = unit_format(suffix = "%")) +
labs(title="Comparison of Lung Tissue and Controls", subtitle = "Ranked by lung tissue specimens", y = "% Relative Abundance")
# for further customizing of the legend, try adding a guide_legend() layer
How can we use fill color to show the phyla to which each OTU belongs?
# First we need to prepare the taxonomic data - In this case we did this earlier in our overall RA data prep.
# Next, simply change the fill parameter to Phylum and rerun the ggplot code
ggplot(agg_df, aes(x = OTU, y = Mean_perc, fill = Phylum)) +
geom_col() + # use show.legend = FALSE to remove the legend entirely
facet_grid(reagent_tissue ~ .) +
geom_errorbar(aes(ymin = Mean_perc - SEM, ymax = Mean_perc + SEM, width = 0.3)) +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", plot.margin = unit(c(0.5,0.5,0.5,1),"cm"), legend.title = element_blank()) +
scale_y_continuous(labels = unit_format(suffix = "%")) +
labs(title="Comparison of Lung Tissue and Controls", subtitle = "Ranked by lung tissue specimens", y = "% Relative Abundance")
# Encourage participants to run these by commenting out a layer, altering 1 parameter to develop a deeper understanding of how to build up or alter a ggplot object
Next we’ll create another relative abundance plot to examine lung tissue specimens by vendor.
# First we need to prepare the data: lung tissue
# Order the data by most abundant OTUs within Charles River vendor
# Include an errorbar
# Color by phyla
# Add Genus identifier to the OTU number where possible
ra_df <- otu_df %>% # Note at this is just one of several possible approaches to filtering to retain the correct rows
filter(tissue_type == "Lung") %>% # 40 specimens
droplevels() # remove factor levels with no remaining observations - saves potential headaches in downstream wrangling and visualizations
# Determine the order of OTUs based on means within Lung specimens from Charles River
otu_order <- names(sort(colMeans(filter(ra_df, vendor == "Charles_River")[, str_detect(colnames(ra_df), "Otu")]), decreasing = TRUE))
# Keep top 20 OTUs, then reshape the dataframe from wide to long format
g_df <- ra_df %>%
select(str_which(colnames(ra_df), "Otu", negate = TRUE), colnames(ra_df[, otu_order[1:20]])) %>% # Keep non-OTU columns, followed by the 20 most abundant OTU columns
pivot_longer(cols = -str_which(colnames(ra_df), "Otu", negate = TRUE), names_to = "OTU", values_to = "Percentage",
names_ptypes = list(OTU = factor(levels = otu_order[1:20]))) # Ensuring OTU column is a factor is important due to the ordering expected in RA plots
# Join Phylum column from taxonomy table
tax_df <- otu_good_taxonomy[otu_good_taxonomy[,"OTU"] %in% otu_order[1:20], c("OTU", "Phylum")] # Keep only the rows from ranked OTU vector, and OTU + whatever taxonomic level to use for filling bars
tax_df$OTU <- factor(tax_df$OTU, levels = otu_order[1:20]) # set OTU factor levels to be in the same order as in g_df. If this is not done, our OTU column will be coerced to a character vector when we join (we'll lose the levels)
g_df <- inner_join(g_df, tax_df, by = "OTU")
# dimensions have gone from 40 745 to 1300 18
# Generate mean and sem for each group
agg_df <- g_df %>%
group_by(vendor, OTU, Phylum) %>%
summarize(Mean_perc = mean(Percentage), SEM = sqrt(var(Percentage)/length(Percentage)), .groups = "drop")
# Add Genus levels to the OTU Number
levels(agg_df$OTU) <- fct_inorder(paste0(otu_good_taxonomy[levels(agg_df$OTU), ]$Genus, " (", levels(agg_df$OTU), ")"))
This time, we’ll create the final RA plot in one pass.
(ra_vend_plot <- ggplot(agg_df, aes(x = OTU, y = Mean_perc, fill = Phylum)) +
geom_col() + # use show.legend = FALSE to remove the legend entirely
facet_grid(vendor ~ .) +
geom_errorbar(aes(ymin = Mean_perc - SEM, ymax = Mean_perc + SEM, width = 0.3)) +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", plot.margin = unit(c(0.5,0.5,0.5,1),"cm"), legend.title = element_blank()) +
scale_y_continuous(labels = unit_format(suffix = "%")) +
labs(title="Comparison of Lung Tissue across vendors", subtitle = "Ranked by Charles River lung tissue specimens", y = "% Relative Abundance"))
Stacked bars also us to view trends across samples, at the cost of potentially making it more difficult to clearly measure differences in OTUs. There are several strategies to consider when creating such a visualization. Sometimes, when there are many OTUS within a taxonomic level, it may be of interest to combine levels or combine groups of related OTUs to avoid hundreds of very small lines on the resulting bars. This improves the overall visual clarity, but there should be a compelling reason for performing the grouping operation. Additionally, just as in the Rank Abundance visualizations we’ve already seen, how categories (OTUs or otherwise) are ordered is an important component in clearly presenting one’s results.
A basic example that requires no additional processing, can be constructed using Phylum, given that g_df already has this column thanks to our prior data preparation. The following visualization allows us to see broad trends and outliers in the relative abundance of various phyla across the samples.
# Stacked bar example (Phyla) by Mouse
ggplot(g_df, aes(x = mouse, y = Percentage, fill = Phylum)) +
geom_bar(position = "fill", stat = "identity") +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", plot.margin = unit(c(0.5,0.5,0.5,1),"cm"), legend.title = element_blank()) +
labs(title="Comparison of Lung Tissue across vendors", subtitle = "Ranked by Charles River lung tissue specimens, fill by Phyla", y = "Relative Abundance")
What would happen if we used the OTU column to fill corresponding portions of the bars? Note that because we’ve excluded all but the top 20 OTUs, the position parameter needs to changed from what was used in the Phyla visualization. Without changing this, the implication is that these 20 OTUs show 100% of the data (which is not true).
# Stacked bar example (OTU) by Mouse
ggplot(g_df, aes(x = mouse, y = Percentage, fill = OTU)) +
geom_bar(position = "stack", stat = "identity") + # Note parameter adjustment
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", plot.margin = unit(c(0.5,0.5,0.5,1),"cm"), legend.title = element_blank()) +
scale_y_continuous(labels = unit_format(suffix = "%")) +
labs(title="Comparison of Lung Tissue across vendors", subtitle = "Ranked by Charles River lung tissue specimens, fill by top 20 OTUs", y = "% Relative Abundance")
Suppose we didn’t care for the white space, which represents the abundance of OTUs outside the top 20. All of the OTU data could be shown instead. In this case, careful consideration needs to be given for how to effectively represent the results. It’s clear that showing hundreds or thousands of OTUs in a single bar is not practical, and may also distract from other signals. Additionally, including hundreds of OTUs in a legend creates an plot that becomes difficult to examine (expanding the fig.height and fig.width parameters in the respective code chunk label can help). Most likely, many of the OTUs will need to be collapsed into an ‘Other’ category, or into related groups. The next code chunk contains an example of this, along with additional visual aesthetic modifications.
# reprocess the data to retain all OTUs by keeping the entire otu_order vector
g_df_redux <- ra_df %>%
select(str_which(colnames(ra_df), "Otu", negate = TRUE), colnames(ra_df[, otu_order])) %>% # Keep non-OTU columns, followed by the OTUs in order of abundance
pivot_longer(cols = -str_which(colnames(ra_df), "Otu", negate = TRUE), names_to = "OTU", values_to = "Percentage",
names_ptypes = list(OTU = factor(levels = otu_order))) # Ensure OTU column is a factor is important due to the ordering expected in RA plots
g_df_redux$OTU <- g_df_redux$OTU %>%
fct_relabel(~ case_when(.x %in% otu_order[1:20] ~ .x, # Retain the top 20 as independent levels
TRUE ~ "Other"), levels = c(otu_order[1:20], "Other")) # Any OTU besides the top 20 becomes 'Other'
# Sort the data by the factor used for fill to obtain desired appearance. This can take some experimentation.
g_df_redux <- g_df_redux %>%
arrange(OTU)
# Stacked bar example (OTU) by Mouse
ggplot(g_df_redux, aes(x = mouse, y = Percentage, fill = fct_rev(OTU))) +
geom_bar(position = "stack", stat = "identity") +
theme_bw() +
theme(panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", plot.margin = unit(c(0.5,0.5,0.5,1),"cm"), legend.title = element_blank()) +
scale_y_continuous(labels = unit_format(suffix = "%")) +
labs(title="Comparison of Lung Tissue across vendors", subtitle = "Ranked by Charles River lung tissue specimens, fill by OTU", y = "% Relative Abundance") +
scale_fill_manual(values = c("#A0A0A0", hue_pal()(20))) # Former is hexcode for a grey shade
Next we’ll explore how to conduct significance testing.
The following is an example of a permanova implementation using vegan::adonis().
# Permanova
set.seed(8947) # setting a seed allows for reproducibility. Both of these lines need to be run together for it to work.
adonis(otu_lung_hel~otu_lung$vendor, method="euclidean", permutations = 10000) # 9.999e-05 ***
##
## Call:
## adonis(formula = otu_lung_hel ~ otu_lung$vendor, permutations = 10000, method = "euclidean")
##
## Permutation: free
## Number of permutations: 10000
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## otu_lung$vendor 1 3.175 3.1754 4.2122 0.09979 9.999e-05 ***
## Residuals 38 28.646 0.7539 0.90021
## Total 39 31.822 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The left hand side of the formula is the transformed data, and the right hand side is a vector (column) container the group identifiers of interest.
Because this may take some time to finish (about 4.5 minutes on a powerful machine), the following code is commented out to prevent it slowing down the creation of our report, which we’ll get to later on. To run this section at a future point that’s convenient for you, begin by simply highlighting all of the lines in the code chunk and then using CTRL + Shift + C to uncomment the code (mac). The code responsible for loading mvabund in section 2.1 will also need to be uncommented. [additional explanation, topics here]. For code chunks that take a while to execute, whether due to statistical analyses or visualizations, it may be beneficial to set the code chunk cache parameter to TRUE. This will enable you to recreate your report much more quickly when making small changes elsewhere.
# a <- Sys.time() # Sometimes this can take a while, so it might be useful to keep track of the time with start/stop points to allow us to compute the duration.
# otu_mva <- mvabund(otu_lung[,str_which(colnames(otu_lung), "Otu")]) # select only lung specimens and Otu columns
# otu_ven_many <- manyglm(otu_mva ~ otu_lung$vendor)
# otu_mva_results <- anova(otu_ven_many, p.uni="adjusted")
# b <- Sys.time()
# b - a
#
# plot(otu_ven_many) # Residual plot
# otu_mva_results$table # Signif
#
# # Example output
# # Analysis of Deviance Table
# #
# # Model: manyglm(formula = otu_mva ~ otu_lung$vendor)
# #
# # Multivariate test:
# # Res.Df Df.diff Dev Pr(>Dev)
# # (Intercept) 39
# # otu_lung$vendor 38 1 797.9 0.001 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The following is a Cage example using vegan::cca(). For more examples of CCA and other ordination techniques, check out the Intro to Vegan.
# Run anova on our hellinger transformed data. We'll also need to pass in one or more vectors of metadata corresponding to the constraining variables.
lung_meta <- dplyr::filter(otu_df, specimen %in% rownames(otu_lung)) %>%
dplyr::select(-str_which(colnames(otu_df), "Otu"))
cage_cca <- cca(otu_lung_hel ~ lung_meta$cage) # First create the cca model object
anova(cage_cca) # Analyze the model
[Descriptive text here]
While we could make use of our previous mvabund results and examine whether there are any significant taxa, for time considerations we’ll show only the code to do so.
# # The following lines could be run with an mvabund model
# # Restructure, sort, filter to retrieve significant OTUs
# otu_ven_sigpvals <- data.frame(OTU = colnames(otu_mva_results$uni.p), uni_p = otu_mva_results$uni.p[2,]) %>%
# filter(uni_p <= .05) %>% # The threshold here can be adjusted to the desired level.
# droplevels() %>%
# arrange(uni_p) # Orders the pvalues
#
# # label OTUS
# otu_ven_sigpvals$OTU <- fct_inorder(paste0(otu_good_taxonomy[levels(otu_ven_sigpvals$OTU), ]$Genus, " (", levels(otu_ven_sigpvals$OTU), ")"))
Here is an example of formatting a dataframe (in this case containing pvalues) for display in a report.
# Note that this is an example constructed for illustration purposes only
mva_sigpvals <- data.frame(Family = c("Lachnospiraceae", "Erysipelotrichaceae", "Porphyromonadaceae", "Clostridiaceae_1", "Streptococcaceae", "Lactobacillaceae"), uni_p = c(0.001, 0.003, 0.015, 0.016, 0.017, 0.017))
knitr::kable(mva_sigpvals, caption = "Vendors, Significant OTUs (.05)", align = "l") # Check out ?kable for further specifications.
| Family | uni_p |
|---|---|
| Lachnospiraceae | 0.001 |
| Erysipelotrichaceae | 0.003 |
| Porphyromonadaceae | 0.015 |
| Clostridiaceae_1 | 0.016 |
| Streptococcaceae | 0.017 |
| Lactobacillaceae | 0.017 |
# # This example would use the output from previous code chunk rather than our arbitrary dataframe
# knitr::kable(otu_ven_sigpvals, caption = "Vendors, Significant OTUs (.05)", align = "l")
Note: this depends on whether we’re using regression or classification. A generic variable importance plot is shown for one iteration of the model. For classification as shown below, we prefer to use MeanDecreaseAccuracy as our feature importance metric rather than MeanDecreaseGini, due to increased bias in the latter.
# Aggregate the OTU data into family. This requires a fair amount of data reshaping and joining operations. Wrapping these steps into a function is ideal so that you don't have to retrace every step for simply changing a taxonomic level for aggregation (e.g. Phylum vs Family).
fam_df <- otu_df %>%
dplyr::select(specimen, contains("Otu")) %>% # Keep only the specimen identifier and OTU columns
tidyr::pivot_longer(cols = -stringr::str_which(colnames(.), "Otu", negate = TRUE), names_to = "OTU", values_to = "Percentage") %>% # Take data into long form in preparation for join operation
dplyr::inner_join(dplyr::select(otu_good_taxonomy, OTU, Family), by = "OTU") %>% # Determine which columns to join from taxonomy table. In this example, we're adding the Family column.
dplyr::group_by(specimen, Family) %>%
dplyr::summarize(Percentage = sum(Percentage), .groups = "drop") %>% # Note that this step removes any metadata columns which aren't grouped by
tidyr::spread(key = Family, value = Percentage) %>% # Return data from long form to wide
dplyr::inner_join(dplyr::select(otu_df, -contains("Otu")), by = "specimen") %>%
dplyr::select(specimen, reagent_tissue:IL1a, everything()) # Pull metadata coluns to the front, all Family columns follow afterwards
Next, we’ll run a basic randomForest.
# For lung specimens, retain only vendor & OTU columns
rf_df <- dplyr::filter(fam_df, tissue_type == "Lung") %>%
dplyr::select(vendor, levels(otu_good_taxonomy$Family)) %>%
droplevels()
a <- Sys.time()
set.seed(3462)
rf_vendors <- randomForest(vendor ~ ., data = rf_df, importance = TRUE) # Setting the importance parameter to TRUE allows us to obtain the variable importance info we're after.
b <- Sys.time()
b-a
## Time difference of 0.06178498 secs
# Basic variable importance plot
varImpPlot(rf_vendors, scale = FALSE)
In order to better gauge the robustness of these findings, let’s run the random forest model 100 times (note: this took around 30 seconds on a powerful machine).
a <- Sys.time()
nforest <- 100
set.seed(3462)
rf_list <- vector("list", nforest) # Initialize a list to store the randomForests
for (i in seq_along(1:nforest)) { # Loops nforest number of times, creating a model and saving it into the list each time.
rf_list[[i]] <- randomForest(vendor ~ ., data = rf_df, importance = TRUE)
}
b <- Sys.time()
b-a
## Time difference of 6.142403 secs
# Extract the feature importances
fi_df <- purrr::map(rf_list, ~{.[["importance"]][, 3]}) %>%
unlist() %>%
data.frame(Feature = names(.), MeanDecreaseAccuracy = .)
meas <- colnames(fi_df)[2] # "MeanDecreaseAccuracy"
# Identify the top features
top_feat <- fi_df %>% dplyr::group_by(Feature) %>%
summarize(`:=`(!!meas, mean(.data[[meas]])), .groups = "drop") %>%
arrange(desc(.data[[meas]])) %>%
head(20) %>% # Top 20 features
droplevels()
# Retain only the top features
fi_df <- dplyr::filter(fi_df, Feature %in% top_feat$Feature) %>%
droplevels()
fi_df$Feature <- forcats::fct_reorder(fi_df$Feature, fi_df[, 2], .fun = mean, .desc = TRUE)
Next we’ll create boxplots of the feature importance over the 100 random forest runs.
ggplot(fi_df, aes(x = Feature, y = fi_df[, 2])) +
geom_boxplot(outlier.alpha = 0.4) +
theme_bw() +
labs(y = meas, title = paste0("Feature Importance: vendor ~ Family microbiome"), subtitle = paste0("Model: rf_df \nRuns: ", length(rf_list), ", mtry: ", rf_list[[1]]$mtry)) +
coord_flip() + # Flip the axes for improved readability
scale_x_discrete(limits = rev(levels(fi_df$Feature)))
In this section we’ll explore several measures of community diversity and examine how they can be used within our analyses.
First, we’ll look at the diversity within specimens. We are interested in measures of:
* Richness - how many unique taxa are there within a specimen?
* Evenness - how evenly distributed are the taxa?
* Dominance - what is the relative abundance of the dominant (largest %) taxa?
To measure α-diversity, we’ll compute the Shannon diversity using the diversity() function, which calculates the shannon index by default. Then we’ll select the observations of interest and compute summary statistics. The formula for deriving the Shannon index is shown below:
\[H' = -\sum_{i=1}^{R}p_{i} \ln p_{i}\]
# Create dataframe w/ Sample name, Shannon, Total Reads, Dominance on a per sample basis
alpha_div_df <- bind_cols(specimen = otu_df$specimen,
Shannon = diversity(otu_good), # Shannon diversity per specimen
Total_Reads = pmap_dbl(data.frame(otu_good), sum), # rowwise sum (by specimen)
Dominance = pmap_dbl(dplyr::select(otu_df, contains("Otu")), max)) %>% # Dominance: max % of top OTU per specimen
# mutate(Dominance = round(Dominance, 3)) # %>% # optional rounding; pipe(s) will need to be connected
left_join(otu_df[,1:6], by = "specimen") # Join first 6 metadata columns, allows for easier filtering
shan_lung <- alpha_div_df %>%
dplyr::filter(tissue_type == "Lung") %>% # Retain only lung specimens for visualization
droplevels()
# Create Shannon diversity summary statistics for bars/error bars
agg_shan_lung <- shan_lung %>%
group_by(vendor) %>%
summarize(Mean_Sh = mean(Shannon), SEM_Sh = sqrt(var(Shannon)/length(Shannon)), .groups = "drop") # Create summary statistics
To visualize the results, we’ll plot the specimens’ α-diversities and means using ggplot().
ggplot(agg_shan_lung, aes(x = vendor, y = Mean_Sh, fill = vendor)) + # This brings in the aggregated dataframe and assigns columns to various plot elements
geom_col(show.legend = FALSE) + # Add bars representing the mean of specimens' Shannon diversity by vendor
geom_jitter(data = shan_lung, aes(x = vendor, y = Shannon), width = .2, alpha = .5, show.legend = FALSE) + # Use specimen level data, map columns to scatterplot attributes
geom_errorbar(aes(ymin = Mean_Sh - SEM_Sh, ymax = Mean_Sh + SEM_Sh), alpha = .5) + # Add an errorbar layer
theme_bw() + # A minimalist theme
labs(y = "Shannon Diversity", title = "Shannon Diversity of Lung Specimens by Vendor") # Adjust axis label and title
We can also test the difference between the means of these groups for significance.
# T-Test for the two vendors. This approach would also work if there were 3 or more groups.
TukeyHSD(aov(shan_lung$Shannon ~ shan_lung$vendor))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = shan_lung$Shannon ~ shan_lung$vendor)
##
## $`shan_lung$vendor`
## diff lwr upr p adj
## Jackson_Labs-Charles_River 1.139108 0.6790736 1.599143 1.28e-05
Rarefaction curves can also be used to explore α-diversity. First, the data needs to be prepared.
# Generate the rarefaction data, combine with metadata, transform for plotting, and aggregate
rarefy_agg <- rarefy(dplyr::select(otu_raw, -label, -numOtus), sample = 1:1000) %>% # subsample for each number in 1:1000; an alternative is to set sample = to the integer length of primary interest, e.g. 1000
data.frame(row.names = rownames(otu_good)) %>% # Use cleaned up rownames from otu_good rather than messy names from otu_raw
rownames_to_column("specimen") %>%
pivot_longer(-specimen, names_to = "Sample_Seq_Num", names_prefix = "N", names_transform = list(Sample_Seq_Num = as.numeric), values_to = "Unique_Species") %>% # Transform data from long to wide format
left_join(dplyr::select(otu_df, specimen, reagent_tissue, vendor, tissue_type), by = "specimen") %>% # Join reagent_tissue, vendor, and tissue_type columns
dplyr::filter((reagent_tissue == "Control") | (tissue_type == "Lung")) %>% # Keep only lung specimens and negative controls
group_by(vendor, Sample_Seq_Num) %>%
summarize(Mean_Species = mean(Unique_Species), SEM = sqrt(var(Unique_Species)/length(Unique_Species)), .groups = "drop") # Create summary statistics by vendor (negative controls are considered 'None')
# write.table(rarefy_raw,file = "rarefy.txt") # Create intermediate table
To visualize the rarefaction curves, we’ll use three geom_line layers as follows:
ggplot(rarefy_agg, aes(x = Sample_Seq_Num, y = Mean_Species, color = vendor)) +
geom_line(alpha = 0.5) + # Line for the mean
geom_line(aes(y = Mean_Species - SEM), linetype = "dashed") + # Lower standard error
geom_line(aes(y = Mean_Species + SEM), linetype = "dashed") + # Upper standard error
theme_bw() +
theme(panel.grid = element_blank(), legend.position = "top", legend.title = element_blank()) +
scale_color_hue(labels = c("Charles River", "Jackson Labs", "Negative Controls")) + # Fix legend labels
labs(x = "Number of Sequences Sampled", y = "Unique Species", title = "Rarefaction curves for lung specimens and negative controls")
Relative abundance plots can also be used here.
Next we’ll examine diversity across specimens.
Revisiting our PCA ordination of lung specimens by vendor:
# This approach combines base R plotting functionality and vegan's ordispider function, which creates the lines connecting points to the centroids.
# Unlike with ggplot(), these lines must all be run at once in order to produce the desired result.
plot(otu_lung_pca, type = "n", font = 2, font.lab = 2, xlab = "PC1 (14.58% Explained)", ylab = "PC2 (10.68% Explained)", main = "PCA of Specimen Origin, colored by Vendor", display = "sites")
points(otu_lung_pca, pch = 19, col = pcapal[as.numeric(otu_lung$vendor)]) # pch determines the point shape and col determines the colors to be used.
ordispider(otu_lung_pca, otu_lung$vendor, label = TRUE) # Draw lines from centroids to the points belonging to each group.
legend("topright", levels(otu_lung$vendor), pch = 19, col = pcapal, title = "Vendor") # Format and add details to legends
Bray–Curtis dissimilarity is a statistic used to quantify the compositional dissimilarity between two different sites, based on counts at each site. A score of 1 represents complete dissimilarity between two specimens, meaning no OTUs are shared between the specimens, for example. On the other hand, a score of 0 signifies complete similarity. The following formula provides the index of dissimilarity, though we’ll use vegdist to handle computation:
\[BC_{IJ} = 1 - \frac{2C_{IJ}}{S_{i} + S_{J}}\]
# This helper function will be used to disambiguate columns associated with paired BC comparisons
# For example, the reagent_tissue for sample 1 could be different than sample 2, thus an identifier column
# might be necessary for each in order to filter the results as desired.
bc_rename <- function(colname) {
if(str_detect(colname, ".x$")) { # For .x, simply remove the suffix
return(str_remove(colname, ".x$"))
} else if (str_detect(colname, ".y$")) {
return(str_replace(colname, ".y$", "_comparison")) # For .y, replace with _comparison suffix
} else { # Otherwise name does not need adjustment
return(colname)
}
}
# Goal: Full BC matrix converted to a long dataframe of all possible specimen comparisons.
# We could also add a step to filter for specimens of interest
sub_df <- otu_df %>% # First, subset to keep rows of interest for the Bray-Curtis comparisons
dplyr::filter(reagent_tissue == "Tissue") %>% # Exclude controls
droplevels()
bc_df <- sub_df %>%
dplyr::select(contains("Otu")) %>% # retain OTU columns only
vegdist(diag = TRUE, upper = TRUE) %>% # method = "bray" by default
as.matrix() %>%
as.data.frame(row.names = sub_df$specimen) %>%
setNames(sub_df$specimen) %>% # add column names
rownames_to_column("specimen") %>% # create specimen column from rownames
pivot_longer(cols = -specimen, names_to = "comparison", values_to = "BC") %>% # transform from wide to long
dplyr::filter(specimen != comparison) %>% # remove comparisons against the same specimen (the diagonal in the original distance matrix)
left_join(dplyr::select(otu_df, specimen:tissue_type, cage), by = "specimen") %>% # join cols for subsetting by specimen1
left_join(dplyr::select(otu_df, specimen:tissue_type, cage), by = c("comparison" = "specimen")) %>% # join cols for specimen2 (comparison)
rename_with(~ map_chr(., bc_rename)) # use the custom renaming function defined above
# Here we'll subset to retain only the within vendor + tissue comparisons. We could also subset even further to group by `shipments` within vendors.
bc_tissue_vend_df <- bc_df %>%
dplyr::filter(tissue_type == tissue_type_comparison, vendor == vendor_comparison) %>% # Comparisons of interest are within tissue + vendor groupings
droplevels() %>%
group_by(specimen, vendor, tissue_type) %>% # Carefully consider which variables to use for aggregation
summarize(Mean_BC = mean(BC), SEM_BC = sqrt(var(BC)/length(BC)), .groups = "drop")
# Plot within vendor tissue-tissue BC comparisons
ggplot(bc_tissue_vend_df, aes(x = vendor, y = Mean_BC, fill = vendor)) +
geom_boxplot() + # alternatively, use a geom_col and an errorbar layer
geom_point(size=1, alpha=0.5, position = position_jitter(width = 0.175, seed = 8237)) + # add 'noisy' points for specimens (adjust width to improve visibility on discrete x-axis)
theme_bw() + # A minimalist theme
labs(y = "Mean Bray-Curtis", title = "Bray-Curtis Dissimilarity: Tissue samples by vendor") + # Adjust axis label and title
facet_grid(. ~ tissue_type) +
theme(panel.grid = element_blank(), legend.title = element_blank()) +
expand_limits(y = c(0,1)) # Start y-axis at 0
Future example. [Descriptive text here]
Usually it’s not enough that you’ve generated statistics or created visualizations. You also need to be able to share these results. Fortunately, there are a variety of ways to do this.
Here is an example of exporting otu_lung to a .csv file, which can then be opened in Excel or your spreadsheet editor of choice. Note that depending on your locale (whether , or ; is the expected separator), you may need to use write.csv2() to obtain the desired results. As an additional exercise, try changing the name of the dataframe or the name of the output file.
# This will save a new file called otu_lung.csv in the current working directory
write_csv(otu_lung, "otu_lung.csv")
The following code chunk creates dataframes aggregated to various taxonomic levels and saves them as .csv files in the project directory. Bonus: instead of using the following loop to write these tables to files, they could be added to the global environment instead.
# Create dataframes aggregated at the specified taxonomic levels
for(taxon in c("Phylum", "Class", "Order", "Family")) {
otu_df %>% # Contains results of otu_good after decostand
dplyr::select(specimen, contains("Otu")) %>% # retains Specimen (or optionally adjust to Sample_name) and Otu cols
pivot_longer(cols = -stringr::str_which(colnames(.), "Otu", negate = TRUE), names_to = "OTU", values_to = "Percentage") %>% # wide to long transformation
inner_join(dplyr::select(otu_good_taxonomy, OTU, all_of(taxon)), by = "OTU") %>% # join taxonomy table
group_by(specimen, .data[[{taxon}]]) %>% # prepare to aggregate at specified taxonomic level
summarize(Percentage = sum(Percentage), .groups = "drop") %>%
spread(key = taxon, value = Percentage) %>% # transform to wide
write_csv(paste0("CRJL_", tolower(taxon), "_table.csv")) # writes to .csv
# Note, above line can commented and line below uncommented to add these dataframes to the global env
# assign(x = paste0(substr(tolower(taxon), start = 1, stop = 3), "_df_ex"), pos = 1) # assigns values to df_names in curent environment (presumed to be global env)
}
Next, we’ll save a plot. There are several methods for doing this and one approach utilizing ggsave is shown in the following code chunk.
Note that we will also have access to a folder (in this case: Lung_Microbiome_Tutorial_files/figure-html/) with all of the plot images used in rendering the final report after finishing step 7.3, so these images could be retrieved from this folder after the knitting process is completed. If your report generates many figures, this might be a good way to go.
Additionally, right clicking on an image displayed inline (within Lung_Microbiome_Tutorial.Rmd) will bring up an option to save the image.
Yet another way to save a plot is to first save a plot to an object, for instance p, then type p followed by return in the console. Finally, navigate to the plot tab in RStudio and click the export drop down to reveal several methods for exporting the visualization that was just created from running p.
# The optional `path` parameter can be added onto the filename, used to denote a different directory. Use ?ggsave to check out other parameters
ggsave(filename = "vendor_relabund.pdf", plot = ra_vend_plot, width = 30, height = 30, units = "cm")
If all goes well, we should also be able to create a report from the narrative and code contained within this R Markdown document that we’ve been working on. Click the Knit button (look for the blue ball of yarn) near the top of RStudio interface to generate the default report type, which is set to HTML for this document. An HTML document containing this report will be created in the current working directory and it will display as a popup window or in the RStudio Viewer pane. We can easily change the type of file that is created by altering the output type in the header at the top of this document.
Note that there will also be a figure-html folder containing all of the plots created through the code chunks in this document. For example, if the Rmd file (the one we’re using) is called Lung_Microbiome_Tutorial, then the figures would be located in: Lung_Microbiome_Tutorial_files/figure-html.
Congratulations! You’ve made it to the end of this introduction to lung microbiome analyses. Now that you’ve seen the basics, demonstrated between specimens and controls as well as within lung specimens, can you create additional analyses using the cage variable? Feel free to explore explore this and examine other possible relationships in the data on your own.
Resources for learning more about various topics.
Although we used mothur to process the 16S data prior to this analysis, alternatives include QIIME2 and DADA2. Beyond processing, each also provides varying support for exploration and analysis tasks during or post-processing.
This section contains some guides and further reading about R and key libraries that facilitate your work.
R for Data Science
R Markdown tutorial
R Markdown: The Definitive Guide
tidyverse
Ggplot2
- Includes several reference guides and learning sources
Pivoting
If you’re new to git/github and version control in general, there are a variety of resources to help get you started:
git
github tutorial
cbmbtools library
- This package contains functions for assisting in 16S microbiome analysis.